A Walk with SGD

نویسندگان

  • Chen Xing
  • Devansh Arpit
  • Christos Tsirigotis
  • Yoshua Bengio
چکیده

Exploring why stochastic gradient descent (SGD) based optimization methods train deep neural networks (DNNs) that generalize well has become an active area of research. Towards this end, we empirically study the dynamics of SGD when training over-parametrized DNNs. Specifically we study the DNN loss surface along the trajectory of SGD by interpolating the loss surface between parameters from consecutive iterations and tracking various metrics during training. We find that the loss interpolation between parameters before and after a training update is roughly convex with a minimum (valley floor) in between for most of the training. Based on this and other metrics, we deduce that during most of the training, SGD explores regions in a valley by bouncing off valley walls at a height above the valley floor. This ’bouncing off walls at a height’ mechanism helps SGD traverse larger distance for small batch sizes and large learning rates which we find play qualitatively different roles in the dynamics. While a large learning rate maintains a large height from the valley floor, a small batch size injects noise facilitating exploration. We find this mechanism is crucial for generalization because the valley floor has barriers and this exploration above the valley floor allows SGD to quickly travel far away from the initialization point (without being affected by barriers) and find flatter regions, corresponding to better generalization.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust Decentralized Differentially Private Stochastic Gradient Descent

Stochastic gradient descent (SGD) is one of the most applied machine learning algorithms in unreliable large-scale decentralized environments. In this type of environment data privacy is a fundamental concern. The most popular way to investigate this topic is based on the framework of differential privacy. However, many important implementation details and the performance of differentially priv...

متن کامل

Finite-Time Analysis of Projected Langevin Monte Carlo

We analyze the projected Langevin Monte Carlo (LMC) algorithm, a close cousin of projected Stochastic Gradient Descent (SGD). We show that LMC allows to sample in polynomial time from a posterior distribution restricted to a convex body and with concave log-likelihood. This gives the first Markov chain to sample from a log-concave distribution with a first-order oracle, as the existing chains w...

متن کامل

Protective effect of Viola tricolor and Viola odorata extracts on serum/glucose deprivation-induced neurotoxicity: role of reactive oxygen species

Objective: Oxidative stress plays a key role in the pathophysiology of brain ischemia and neurodegenerative disorders.Previous studies indicated that Viola tricolor and Viola odorataare rich sources of antioxidants. This study aimed to determine whether these plants protect neurons against serum/glucose deprivation (SGD)-induced cell death in an in vitro model of ischemia and neurodegeneration....

متن کامل

The mechanism of neuroprotective effect of Viola odorata against serum/glucose deprivation-induced PC12 cell death

Objective: Oxidative stress is associated with the pathogenesis of brain ischemia and other neurodegenerative disorders. Previous researches have shown the antioxidant activity of Viola odorata L. In this project, we studied neuro-protective and reactive oxygen species (ROS) scavenging activities of methanol  (MeOH) extract and other fractions isolated from <e...

متن کامل

Weighted parallel SGD for distributed unbalanced-workload training system

Stochastic gradient descent (SGD) is a popular stochastic optimization method in machine learning. Traditional parallel SGD algorithms, e.g., SimuParallel SGD [1], often require all nodes to have the same performance or to consume equal quantities of data. However, these requirements are difficult to satisfy when the parallel SGD algorithms run in a heterogeneous computing environment; low-perf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1802.08770  شماره 

صفحات  -

تاریخ انتشار 2018